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INFORMATION  ANALYSIS 


RESULTS  OF  THE  EXPERIMENTAL  CHECK  OF  SOME 
METHODS  OF  AUTOMATIC  CONSTRUCTION  OF 
CLASSIFICATIONS  AND  INDEXING 

S.  A.  Karasev 

The  subject  of  scientific-information  activity  is  the  scientific 
information  which  represents  the  logical  information  organized  by 
means  of  comparison  and  classification  of  data.  This  circumstance 
requires  the  classification  of  the  documents  containing  the  scientific 
information  [IPYa]  (HriH ) . 

The  main  disadvantage  of  such  IPYa  is  the  shortage  of  vocabulary, 
which  is  connected  with  the  impossibility  to  foresee  the  future 
changes  in  informational  requirements.  However,  the  noted  deficiency 
can  be  eliminated  if  we  provide  the  rapid  reconstruction  of  classifi¬ 
cation  in  sufficiently  small  intervals.  Such  a  reconstruction  is  no 
problem  if  the  classifications  are  constructed  by  the  machine  method. 
Automatically  constructed  classifications  can  be  suitable  for  the 
currently  developing  subject  fields,  for  which  logical  classifications 
have  not  been  developed,  and  be  more  rational  than  the  traditional 
systems  of  classifications. 

In  the  known  algorithms  (strategies)  of  the  automatic  construction 
of  classifications  statistical  methods  which  digress  completely  from 
the  semantic  structure  of  the  documents  are  used  exclusively.  However, 
statistical  methods  can  be  considerably  reinforced  by  the  application 
of  syntactic  analysis. 
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In  spite  of  this,  as  the  basis  for  this  work  the  following 
hypothesis  was  set  down:  statistical  methods  of  the  automatic 

construction  of  classifications,  based  on  improved  criteria,  will 
make  it  possible  to  obtain  better  results  compared  with  the 
known  results  obtained  under  the  same  assumptions. 

In  this  case  the  classifications  of  the  terms  or  classifications 
of  documents  can  be  constructed.  For  our  purposes  we  use  the  following 
working  definition:  the  system  of  classification  is  the  totality, 
bound  by  the  relationship  of  coordination  or  by  the  relationships 
of  coordination  and  subordination  of  sets  of  terms,  each  of  which 
satisfies  a  certain  criterion  of  semantic  similarity  of  the  elements 
.of  a  set  with  each  other. 

1.  THE  FUNDAMENTAL  CHARACTERISTICS  OF  THE 
CLASSIFYING  STRATEGIES 

The  problem  of  the  construction  of  the  classification  of  terms 
in  the  space  of  their  signs,  which  in  the  hypothesis  accepted  by 
us  are  the  frequencies  of  use  of  terms  in  the  documents  (£  *  1,  . . .  N) , 
consists  of  the  division  of  ff-dimenslonal  space  into  m  areas  G^ ,. 

G_,  ...»  G  ,  each  of  which  Is  a  class  of  terms. 

fc  tfl 

If  the  system  of  classification  has  already  been  constructed, 
then  the  individual  signs  of  each  of  the  classified  terms  are  replaced 
by  the  values  of  signs  measured  for  the  appropriate  classes.  With 
such  a  recoding  less  information  is  lost,  and  the  more  uniform  the 
classes  of  terms  in  their  properties. 

Thi3  circumstance  makes  it  possible  to  consider  as  the  fundamental 
characteristics  of  the  classifying  strategies  the  criterion  of  the 
semantic  similarity  between  terms,  the  determination  of  class,  type, 
and  the  degree  of  freedom  of  strategy. 

The  view  of  natural  language  of  scientific  documents  as  a 
statistical  phenomenon  makes  it  possible  to  express  quantitatively 
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the  semantic  similarity  between  terms  on  the  basis  of  the  application 
of  a  certain  statistical  function.  In  spite  of  the  distinction  in  the 
analytical  methods  utilized  for  the  representation  of  the  criterion 
of  semantic  similarity,  they  all  have  a  common  logical  base  which 
consists  of  the  following. 

The  bases  for  the  terminology  of  any  subject  field  are  the 
meaning-bearing  words  which  possess  a  naming  (nominative)  function. 
Meaning-bearing  words  separate  objects  of  the  external  world,  which 
are  the  subject  of  any  scientific  research.  In  scientific  documents 
those  objects  are  described  which  are  found  in  a  logical  bond  with 
each  other.  Therefore  the  words  which  designate  the  appropriate 
objects  are  found  in  the  same  logical  bond  with  each  other. 

Therefore  it  is  possible  to  claim  that  the  words  which  are 
found  in  logical  bond  are  frequently  used  in  the  same  documents  in 
various  combinations  with  each  other  and  much  rarer  (or  they  are  in 
no  way  encountered)  together  with  other  words.  Various  factors  of 
the  semantic  similarity  of  terms  are  also  Intended  for  the  measurement 
of  the  degree  of  logical  bond. 

In  examining  natural  language  from  these  positions  the  frequency 
of  repetition  of  terms  is  considered  as  the  significant  measure  of 
their  importance  which  must  be  considered  in  the  statistical  criterion 
of  semantic  similarity.  Furthermore ,  the  more  extensive  the  volume 
of  the  concept  of  a  term,  the  more  frequently  it  is  encountered 
in  various  combinations  with  other  terms ,  the  wider  the  frequency 
range  of  its  use  in  documents,  and  therefore  the  greater  its  dispersion 

Of  all  known  criteria  of  semantic  similarity  used  In  the  automatic 
construction  of  classifications,  a  unique  one,  which  considers  both 
these  moments,  is  the  correlation  factor.  Furthermore,  the  correlation 
factor  easily  detects  the  interpretive  type  of  dependence  (positive  or 
negative  bond).  This  circumstance  makes  it  possible  to  view  the 
correlation  factor  as  the  most  exact  statistical  measure  of  the 
semantic  similarity  of  terms. 
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Thus,  at  the  basis  for  the  statistical  representation  of  the 
criteria  of  semantic  similarity  lies  the  hypothesis  that  the  degree 
of  the  object-logical  bond  of  terms  can  be  expressed  by  means  of 
the  statistical  correlation  of  the  frequencies  of  the  combined 
occurrence  of  terms  in  documents . 

The  formal  definition  of  class  is  not  given  In  all  the  methods 
of  the  automatic  construction  of  the  classifications  of  terms. 

G.  Borko  [1]  and  G.  Borko  and  M.  Bernik  [2],  using  the  method  of 
principal  components,  accomplished  the  formation  of  the  classes  of 
terms  and  the  sampling  of  their  necessary  number  Intuitively*  on  the 
basis  of  an  analysis  of  components  obtained  as  a  result  of  statistical 
analysis.  J.  Williams  [33,  who  used  discriminant  analysis,  intuitively 
constructed  a  system  for  the  classification  of  documents,  which  as  a 
result  of  statistical  analysis,  was  converted  into  the  classification 
of  terms. 

The  formal  definition  of  class,  most  frequently  utilized  in  the 
automatic  methods  for  the  construction  of  classifications,  is  given 
thus:  a  class  is  that  set  of  elements,  the  mean  value  of  the 
factor  of  the  semantic  similarity  between  which  is  more  than  the  mean 
value  of  the  factor  of  the  semantic  similarity  between  the  elements 
of  the  set  and  the  elements  of  its  complement. 

Such  a  definition  was  used,  for  example,  by  R <  Needham  and 
V.  Ovchinnikov.  They  used  the  fixed  strategy:  R.  Needham  [4] 
constructed  a  nonhierarohical  classification  of  terms  whereupon 
the  number  of  classes  was  established  by  intuitive  means;  V. 
Ovchinnikov  [5]  constructed  a  dichotomous  nonintersecting  classifica¬ 
tion  of  terms . 

There  are  two  types  of  strategies:  agglomerative  and  dividing. 

The  first  accomplish  the  formation  6f  class  because  of  the  association 
of  it3  individual  elements;  the  second  —  separates  class  from  the 
entire  initial,  or  present  at  the  given  moment,  set  of  elements. 

Since  at  the  basis  of  such  strategies  lies  the  matrix  of  semantic 
resemblance,  then  the  advantage  of  the  dividing  strategies  becomes 
obvious,  because  at  every  given  moment  they  use  all  the  information 
placed  in  the  matrix  of  semantic  similarity. 


Any  classification  can  be  presented  In  a  plane.  For  this  let 
us  designate  by  T  the  initial  set  of  terms 

r-«f, (1) 

and  let  us  consider  the  orthogonal  coordinate  system  Oe^e^,  in 

which  and  «2  are  unit  vectors.  The  position  of  any  element 

(tj  -  1,  2,  ....  n)  in  such  a  system  is  uniquely  determined  by 

coordinates  (/? ,  £) ,  where  R  represents  the  number  of  the  class  or 

subject  heading,  and  £  -  the  level  of  hierarchy  (R  and  £  are  positive 

integers).  Then  for  any  t.,  considered  as  a  vector,  the  following 

0 

is  correct : 

</-*•  i+l*»  ( 2 ) 


Condition  if  ■  £  *  0  satisfies  initial  set  T  -  . ..,  tn }. 

All  the  existing  strategies  have  a  number  of  degrees  of  freedom, 
not  exceeding  1,  i.e.,  previously  they  determine  values  of  if  or  £, 
or  if  and  £  simultaneously.  Thus,  for  instance,  if  L  •  1,  then  the 
classification  is  nonhlerarchlcal .  Such  classifications  we  refer 
to  the  methods  of  the  fixed  strategies.  It  is  evident  from  (2) 
that  methods  of  classification  with  the  number  of  degrees  of  freedom 
equal  to  2  are  possible,  i.e.,  such  in  which  the  values  R  and  £  are 
not  previously  preset.  Such  classifications  we  will  refer  to  the 
methods  of  free  strategies. 

From  the  examination  of  the  fundamental  methods  of  the  automatic 
construction  of  classifications  it  follows  that  they  do  not  completely 
algorithmize  the  procedure  for  the  construction  of  classifications 
(they  do  not  give  the  formal  definition  of  class,  or  they  do  not 
give  the  criteria  which  determine  the  necessary  number  of  classes  or 
the  optimum  number  of  levels  of  hierarchy).  This  deficiency 
substantially  lowers  the  effectiveness  of  the  application  of  computers 
for  the  creation  of  the  required  system  of  classification. 
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2.  AUTOMATIC  CONSTRUCTION  OF  CLASSIFICATIONS 
AND  AUTOMATIC  INDEXING 

One  of  the  methods  for  increasing  the  effectiveness  of  IRS 
[information  retrieval  system]  is  the  application  (along  with  the 
automatic  construction  of  classifications)  of  automatic  indexing  which 
uses  these  classifications.  The  introduction  of  such  methods,  which 
ensure  a  sufficiently  high  degree  of  accuracy  during  indexing,  will 
make  it  possible  to  develop  fully  automatic  IRS,  capable  of 
reconstructing  the  systems  of  classifications  in  Short  intervals  and 
thereby  considering  the  change  in  the  informational  requirements  of 
the  users  of  IRS. 

Automatic  Indexing  with  the  application  of  automatically 
constructed  systems  of  classifications  was  worked  on  by  M.  Maron  [6], 

G.  Borko  and  M.  Bernik  [2]  and  3.  Williams  [33*  Who  obtained  an 
accuracy  of  indexing  of  51.8,  55.9*  and  62. 2%  respectively. 

Evidently  the  methods  of  automatic  indexing  with  the  application 
of  automatically  constructed  systems  of  classifications  possess  poor 
accuracy  even  when  the  construction  of  the  classification  of  terms 
procedes  the  intuitive  organisation  of  the  system  for  the  classifica¬ 
tion  of  documents. 

3.  COMPLETELY  ALGORITHMIZED  METHODS  FOR 
THE  CONSTRUCTION  OF  CLASSIFICATIONS 

Having  pointed  out  the  deficiencies  in  the  known  methods  for  the 
automatic  construction  of  classifications,  let  us  consider  two 
approaches  which  ensure  the  complete  algorithmization  of  procedures 
for  the  construction  of  classifications  and  an  increase  in  the  accuracy 
of  automatic  Indexing  with  the  application  of  the  resulting  systems. 

The  question  of  the  probability  of  error  during  indexing  arises 
when  it  is  necessary  to  refer  documents  to  one  or  several  subject 
headings.  It  is  possible  to  offer  various  methods  for  the  evaluation 
of  such  a  probability;  the  shortest  method  consists  of  the  following. 
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Let  us  consider  event  a,  consisting  of  the  assignment  to  the 
document  of  one  of  the  submect  headings  g  (g  =  1,  2,  ...»  m).  Let 
us  assume  that  all  outcomes  of  this  event  equally  probable,  that  is 

p(l)».p<2)«  (3) 

Then,  using  entropy  of  event  a  as  the  measure  of  uncertainty  of 
indexing,  we  obtain 


H  («)=■—  ^  p  (g)  log  p  (g)  -  log  m. 


<*) 


It  follows  from  (4)  that  the  entropy  of  indexing  has  a  minimum  at 
m  ■  2  (if  m  »  1,  then  classification  is  generally  absent). 

The  presence  of  only  two  subject  classes  cannot  always  guarantee 
the  necessary  depth  of  indexing.  If  we  select  such  a  nonhierarchical 
structure,  in  which  every  class  on  any  level  of  hierarchy  divides 
itself  into  two  subclasses,  then  we  will  obtain  a  classification 
which  ensures  the  necessary  number  of  classes  and  a  minimizing  common 
probability  of  the  error  In  indexing. 

However,  the  dichotomous  classification  (In  the  definition  of 
the  subject  heading  as  a  set  of  terms,  the  mean  value  of  the  semantic 
similarity  between  which  Is  higher  than  the  mean  value  of  their 
similarity  to  the  terms  which  form  the  complement  of  the  set)  can 
be  insufficient  in  the  construction  of  classifications  for  many 
subject  areas. 

It  is  possible  to  propose  two  methods  for  the  elimination  of 
this  deficiency.  The  first  consists  of  the  introduction  of  another 
definition  of  the  subject  heading.  Such  a  definition  can  be  the 
following.  The  subject  heading  is  that  set  of  terms  of  the  smallest 
power  which  has  a  negative  total  resemblance  to  the  complement  of  the 
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set.  Here  the  complement  of  the  set  forma  another  subject  heading. 
However,  this  approach  la  inapplicable  in  cases  when  the  factors 
of  the  similarity  of  the  classified  objects  do  not  take  negative 
values . 

The  second  method  consists  of  the  rejection  of  the  dichotomy  and 
the  construction  of  classification  with  any  necessary  number  of 
subject  headings.  If  we  rejeot  subjective  analysis  for  the  purpose 
of  the  complete  algorithmization  of  the  procedure  for  the  construction 
of  classification,  then  it  is  necessary  to  introduce  the  criterion 
which  determines  the  necessary  number  of  subject  headings. 

It  is  evident  that  a  classification  obtained  in  this  way  should 
be  reasonable  from  the  point  of  view  of  man,  l.e.,  possess  an 
informative  capacity,  since  feedback  is  possible  between  an  automatic 
IRS  and  the  user,  when  man  can  correct  the  operation  of  the  system 
based  on  Intermediate  results . 

The  informative  capacity  of  the  system  of  classification  is 
connected  primarily  with  that  information  which  it  communicates  to 
the  user.  This  information  Is  determined  in  turn  by  the  semantics 
of  subject  headings  and  by  their  structure.  Individual  terms.  If 
they  are  Isolated,  are  Indefinite.  They  become  single-valued  and 
informative  if  they  have  been  injected  into  the  system  of  classifica¬ 
tions,  i.e.,  if  they  are  the  elements  of  a  semantic  field  along  with 
other  words,  forming  the  context  which  determines  the  uniqueness  of 
every  element  of  the  semantic  field. 

The  semantic  information  which  is  carried  by  each  individual 
term  is  connected  with  its  unmarked  nature  and  therefore  it  can  be 
expressed  statistically  through  the  factor  of  activity.  Then  the 
semantic  information  of  the  subject  heading  can  be  defined  as  the 
sum  of  the  activities  of  all  terms  entering  into  it. 


The  informative  nature  of  the  system  of  classification,  on  the 


other  hand,  has  been  stipulated  by  the  possibility  of  Its  interpreta¬ 
tion.  Experiments  in  the  automatic  construction  of  classifications 
with  the  application  of  the  theory  of  clumps  showed  that  the  system 
of  classification  is  more  difficult  to  interpret,  the  greater  the 
common  mutual  intersection  of  subject  headings. 

Thus  the  informative  nature  of  the  system  of  classification  as 
the  function  of  its  structure  and  semantic  content  of  the  subject 
headings  is  proportional  to  the  total  activity  of  all  terms  which  form 
a  classification,  and  inversely  proportional  to  the  common  mutual 
intersection  of  subject  headings. 

Let  V  be  the  initial  set  of  the  terms  (j  »  1,  2,  ...,  «),  during 
the  algorithmic  division  of  which  into  groups  G^,  G 2,  ...,  G^  the 
system  of  classification  is  constructed.  The  informative  nature  of 
classification  is  calculated  in  the  following  manner: 


where  —  the  number  of  terms  in  fe-th  group;  fe  »  1,  ....  p;  — 
the  activity  coefficient  of  term  j  entering  into  fe-th  group. 

At  the  first  stages  of  the  construction  of  a  classification, 
when  nonintersecting  or  weakly  intersecting  groups  are  obtained,  the 
numerator  in  formula  (5)  increases  faster  than  the  denominator,  and 
the  informative  nature  of  the  corresponding  number  of  subject  headings 
Increases.  But  as  soon  as  strongly  intersecting  groups  are  formed 
the  denominator  in  (5)  begins  to  increase  faster  than  the  numerator 
and  the  informative  nature  decreases.  In  this  case  the  last  formed 
group  is  broken  down  and  those  subject  headings  are  examined  which 
give  the  maximum  of  Information. 
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For  the  evaluation  of  the  aetivity  (importance)  of  terms  it  is 
advantageous  to  use  their  "impressiveness"  or  generality  within  the 
limits  of  a  certain  subject  area.  The  linguists  call  such  a 
generality  the  unmarked  nature  of  a  sign  and  prove  that  it  is  found 
in  direct  dependence  on  the  frequency  of  the  sign.  In  the  composition 
of  dictionary-mlnlmums  for  the  selection. of  words  they  sometimes  use 
the  frequency  of  the  word  and  the  number  of  sources  in  which  it  was 
met.  Designating  through  X,.  the  frequency  of  term  j  in  document  i, 
let  us  determine  the  activity  of  term  Aj  by  the  following  expression: 


»i  2  *n 

.  /-I 

Al~ — V — • 


(6) 


where  Kj  —  the  number  of  documents  into  which  term  j  enters;  X  — 
volume  of  sample. 

Now  it  remains  to  find  the  criterion  which  determines  the 
optimum  number  of  levels  of  hierarchy  in  the  resulting  system  of 
classification.  For  finding  this  criterion,  which  we  will  name 
the  stop-rule j  we  use  the  syntagmatic  aspect  of  language.  It  is 
evident  that  by  accomplishing  the  Berles  division  of  the  set  of 
terms  It  is  advantageous  to  dwell  on  those  groups  which,  from  the 
point  of  view  of  the  subject  indexer,  are  maximally  interpreting 
headings.  For  the  interpretation  of  a  group  it  is  necessary  to 
formulate  certain  information  from  its  elements.  In  this  case  a 
syntagmatic  bond  should  exist  between  the  terns  entering  into  the 
information.  Psychol lngulstlc  research  showed  that  the  syntagmatic 
bond  is  spread  out  on  4-5  words  counting  from  the  beginning  of  the 
information.  Thus  if  a  group  contains  4-5  terms,  then  from  them 
it  is  possible  to  fora  certain  information  between  the  elements  of 
which  a  syntagmatic  bond  will  exist.  If  the  number  of  terms  n in 
a  certain  group  is  less  than  4,  then  frbm  them  it  is  possible  to 
form  only  "incomplete"  information,  since  it  can  be  augmented  by  the 
lacking  elements  in  order  that  between  all  elements  there  would  be 
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a  syntagmatic  bond  (final  simple  Information  Is  formed).  Groups  with 
more  than  5  terms  represent  the  totality  of  such  simple  Information 
between  which  logical  relationships  of  conjunction  or  disjunction 
exist. 

Thus  we  proposed  two  paths  for  the  completely  algorithmic 
construction  of  classifications:  the  first  Is  based  on  dichotomous 
principle  and  Is  intended  for  the  minimization  of  the  probability 
of  error  during  indexing,  and  the  second  is  based  on  the  criterion 
of  information  capacity  and  makes  it  possible  to  construct  systems 
of  classifications  which  possess  maximum  lnterpretabllity .  The 
dichotomous  principle  together  with  the  stop-rule  forms  the  fixed 
strategy  and  therefore  has  a  limited  field  of  application;  the 
principle  of  information  capacity  together  with  the  stop-rule  forms 
a  free  strategy  and  is  therefore  universal. 

H.  ALGORITHMS  OP  THE  AUTOMATIC  CONSTRUCTION 
OP  CLASSIFICATIONS  OP  TERMS 

The  proposed  method  for  the  automatic  construction  of  classifica¬ 
tions  will  be  realized  in  algorithms  which  include:  the  method  of 
the  initiation  of  groups,  the  method  of  the  evaluation  of  group 
density  and  the  stop-rule,  which  determines  the  level  of  hierarchy, 
on  which  the  division  of  groups  and  factor  analysis  are  terminated. 

At  the  basis  of  the  method  of  the  initiation  of  groups  lie 
the  coefficients  of  activities  of  terms  which  are  determined  by 
formula  (6).  All  terms  are  ranked  in  the  order  of  their  decreasing 
activity  coefficients  and  for  the  formation  of  two  (or  m)  groups 
the  two  (or  m)  most  active  terms  are  used. 

The  centers  of  groups,  having  been  isolated  thus,  subsequently 
accomplish  their  growth  because  of  the  terms  having  the  greatest 
similarities  with  the  centers.  As  the  measure  of  similarity  let  us 
take  the  correlation  factor  which  makes  it  possible  to  consider  not 
only  the  appearance  of  terms  in  documents,  but  also  the  number  of  such 
appearances.  When  a  correlation  matrix  exists  for  terms  R ,  the  measure 
of  group  density  is  calculated  from  the  following  formula: 
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where  S  -  the  sum  of  the  pairwise  correlations  between  the  terms  of 
the  group;  T  -  the  sum  of  the  absolute  values  of  the  pairwise 
correlations  of  the  terms  of  the  group  with  remaining  terms;  ng  and 
«t  —  the  number  of  correlations  in  the  sums  of  S  and  T  respectively. 

The  introduction  of  the  sign  of  absolute  value  in  denominator  (7) 
makes  it  possible  to  extend  this  method  to  matrices  containing  negative 
elements.  At  B  •  1  the  average  value  of  the  cross  correlations  of 
the  selected  group  is  equal  to  the  average  correlation  of  these  terms 
with  all  others.  Such  terms  cannot  be  considered  as  belonging  to 
each  other  because  they  also  belong  to  all  the  other  terms. 

In  factor  analysis  the  method  of  multiple  groups  (7)  is  known, 
the  essence  of  which  consists  of  the  following. 

The  path  has  a  sampling  from  H  documents  (i  •  1,  2,  ....  It), 
the  subject  content  of  which  is  described  by  n-lndex  terms  (j  »  1, 

2,  . ..,  n)  with  a  sufficient  degree  of  completeness  If  we  count  in 
each  of  the  documents  the  frequency  of  use  of  each  term,  then  we 
will  obtain  the  following  matrix 


If  the  frequencies  of  terms  are  standardized  and  centered, 
then  this  is  reduced  to 


The  problem  In  the  construction  of  the  classification  by  the 
statistical  method  is  the  economization  of  the  number  of  characteris¬ 
tics  documents.  For  this  purpose  it  is  assumed  that  all  terms  (J  -  1, 
2,  ...,«)  are  found  in  a  linear  dependence  on  m  hypothetical  terms 
(factors)  F, ,  ....  P  (m  <  n) ,  the  volume  of  concept  of  which 

J.  d  Hi 

includes  the  volumes  of  the  concepts  of  several  initial  terms  and  a 
linear  model  of  this  type  is  selected: 


*y«=B/,F,+a;,P,+  ...  +a/mFm+aiVi,  (10) 

where  V.  —  the  factor  of  uniqueness  of  term  j .  For  the  sake  of 
0 

simplicity  we  will  subsequently  designate  a',  as  a..  There  are  w 

3  3 

equations  of  such  a  type.  Coefficients  in  the  case  of  factors  are 
frequently  called  loads. 

Equation  (10)  for  the  value  of  the  term  a .  in  concrete  document 

0 

i(m  1,  2,  ...»  N )  can  be  written  thus: 


(11) 


The  dispersion  of  the  normalized  frequency  of  the  term  on 
condition  that  all  factors  are  noncorrelated  can  be  presented  in  the 
following  manner: 


l—S*«»aj!,+o*j+  ...  4-aJ^+ej.  (12) 

Components  to  the  right  represent  the  fractions  of  the  single 

dispersion  of  frequency  of  a  term  which  are  attributed  to  hypothetical 

p 

terms  and  to  the  factor  of  uniqueness.  Value  a j  indicates  that 
fraction  of  the  dispersion  of  the  frequency  of  term  f  which  cannot 
be  expressed  through  the  correlations  of  the  frequencies  of  combined 
occurrence  of  terms  and  it  is  called  the  uniqueness  of  the  term. 
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IP*«M'  >  ■'  ■  ' 


That  part  of  dispersion  of  frequency  of  tern  j  which  is  considered 
by  hypothetical  terms  can  be  represented  by  the  sum  of  the  squares 
of  the  loads  of  term  j  on  hypothetical  terms  and  is  called  the 
communality  of  term  j: 


••• +•}«•  (13) 

Linear  model  (10)  can  be  written  for  all  terms  In  an  expanded  form: 


+a,mf’m+<hUt 


(14) 


Such  a  set  of  terms  Is  called  a  factor  set.  In  it  the  hypothetical 
terms  Fp  (p  ■  1,  2,  m)  can  be  both  correlated  and  uncorrelated. 

Unique  factors  U.  U  ■  1,  2,  . ..,  »)  are  always  uncorrelated. 

J 

Before  beginning  the  analysis,  it  is  necessary  to  replace  the 
values  of  the  "self-correlations"  of  terms  with  their  communalltles . 

As  a  rough  estimation  of  "communalltles"  it  is  possible  to  use  the 
values  of  the  greatest  correlations  which  were  taken  from  the  columns 
of  correlation  matrix  R.  Usually  the  hypothetical  terms,  obtained 
as  a  result  of  the  analysis  of  multiple  groups,  are  inclined  to  each 
other.  Therefore  the  fundamental  concept  is  the  concept  of  the 
correlation  matrix  of  hypothetical  terms.  Furthermore  because  the 
factors  have  been  correlated,  then  the  direct  results  of  analysis 
should  lead  to  two  matrices:  a  factor  set  and  a  factor  structure. 

The  first  gives  the  coefficients  of  hypothetical  terms  in  a  linear 
description  of  terms,  and  the  second  —  the  correlation  of  terms  with 
hypothetical  terms. 

The  essence  of  the  method  of  multiple  groups  is  the  representation 
of  hypothetical  terms  by  axes  of  reference  passing  through  the 
centroids  of  the  appropriate  groups  of  terms.  Therefore  it  is  advan¬ 
tageous  to  consider  the  properties  of  such  sums  of  terms  (centroids). 


14 


Although  the  frequencies  of  separate  terms  s^  are  standardized, 
their  sums  are  not  necessarily  standard.  In  the  method  of  multiple 
groups  it  is  assumed  that  the  hypothetical  term  T  passes  through 
clusters  np  of  terms  in  group  Gp : 


Tp=2  z*  (*€0p:  p=i.  2 . »>• 


(15) 


The  calculation  of  dispersions  and  correlations  of  hypothetical 

terms  can  be  accelerated  by  finding  the  specific  preliminary  sums 

of  correlations.  The  first  of  them  is  simply  the  sum  of  the 

correlations  of  every  term  a.  with  all  terms  in  each  group  G  ,  namely: 

3  P 

l=u  2,  p=<ii  a.  (16) 

where  the  addition  is  done  on  fe,  accepting  all  the  values  of  the 

terms  in  group  G  .  When  fe  ■  j,  then  "self-correlation"  is  considered 
P  2 

equal  to  communality ,  that  is  r. .  *  h.  . 

33  3 

The  sum  of  the  correlations  between  all  terms  in  group  Gp 
with  all  the  terms  in  group  G  (Including  the  case  p  *  g)  has  the  form: 

o 

vPt*=  2 “/* U6°p‘-  P.  g~u  2 . m),  ( 17 ) 


Sum  (17)  can  be  expressed  in  terms  pf  initial  correlations  in  the 
following  manner: 


*H-2'ikUea,.  Pi  2,  (18) 

2 

If  r ■  hj  ,  then  the  dispersion  of  the  hypothetical  term  can  be 
expressed  as 
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(19) 


Using  (17),  expression  (19)  can  be  simplified 

(20> 

The  correlation  between  two  hypothetical  terns  7p  and  using  the 
definition,  can  be  presented  in  the  form  of 


Vi“2T"T"'*Vv 


(21) 


where  addition  is  done  for  i  from  1  to  t.  Using  the  previous  formulas 
we  obtain 


g* (22) 

Now  let  us  find  the  correlation  between  the  two  inclined  hypothetical 
terms 


vrvfifrz' 


(23) 


Then  it  is  possible  to  determine  the  correlation  of  terms  with 

hypothetical  terms  (inclined  structure)  through  the  previous  sums 

of  simple  correlations.  The  element  of  structure  Bj^  is  the 

correlation  r*  ,2V  of  the  term  *,  in  the  standard  form  with  the 
j  P  < 

hypothetical  term  T ,  the  dispersion  of  which  is  determined  by  (19). 
It  turns  out  that 


*h~yh' 


(2H) 
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For  obtaining  inclined  characteristics  it  is  necessary  to  have  the 
linear  descriptions  of  the  terms  as  the  functions  of  hypothetical 
terms.  The  coefficients  in  these  linear  equations,  i.e.,  the  elements 
of  the  set,  are  the  coordinates  of  the  points  which  represent  the 
terms  relative  to  the  inclined  axes  of  the  hypothetical  terms  1 
The  matrix  of  set  P  can  be  obtained  from -known  structure  S  and  the 
correlations  between  hypothetical  term  *  according  to  the  following 
equation: 

(25) 

Thus  we  obtained  all  data  necessary  for  the  characteristics  of 
the  inclined  hypothetical  terms.  However,  it  is  advisable  to  carry 
out  the  transition  to  orthogonal  hypothetical  terms  because  such 
terms,  after  their  interpretation  as  subject  headings,  should  not 
meet  together  in  the  same  documents  of  the  array.  Therefore  the 
description  of  the  document  content  by  subject  headings  is  considerably 
simplified. 

Of  all  the  possible  transitions  from  inclined  to  an  orthogonal 
coordinate  system  during  the  analysis  of  multiple  groups  a  special 
solution  is  used  which  possesses  the  following  properties:  the 
first  axis  of  new  system  coincides  with  the  axis  of  the  first  inclined 
hypothetical  term,  and  the  second  lie  in  one  plane  with  the  common 
inclined  hypothetical  terms  and  is  perpendicular  to  the  first.  The 
new  third  axis  is  perpendicular  to  the  plane  of  the  first  two,  etc. 
Since  every  time  the  pairwise  orthogonalization  of  factors  is  realized, 
then  for  the  sake  of  sinq>llcity  we  will  examine  the  case  of  two 
factors . 

If  the  coordinates  of  term  j  in  the  system  of  inclined 
hypothetical  terms  2^  and  f2  are  designated  by  and  bj2»  and  ln 
the  orthogonal  system  P^  and  through  and  a^2,  then  the  bond 
between  the  two  sets  of  coordinates,  when  and  coincide,  is 
expressed  by  the  following  equations 
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«*»-*  ****** 


(26) 


where  «12  -  the  ahgle  between  inclined  hypothetical  terms  and  . 
Since  cos  022*  r  1*172’  then  can  be  expressed  in  matrix 

form 

(27) 


The  conversion  matrix,  which  converts  the  coordinates  of  the 
Inclined  frame  of  reference  into  coordinates  Of  the  orthogonal 
system,  can  be  obtained  from  the  correlation  matrix  between  inclined 
hypothetical  terms.  Using  the  method  of  the  square  root  to  a  matrix 
0,  a  matrix  is  obtained  which  proves  to  be  the  matrix  of  conversion: 

•-rr.  (28) 

Now  we  write  expression  (27)  for  a  case  of  n  terms  end  m  hypothetical 
terms: 


(29) 


where  A  and  P  are  n  *  m  matrices 


(v  and  (b^p)  respectively 


» 


and  2"  -  the  matrix  of  dimension  m,  obtained  by  the  method  of  the 
square  root  from  ♦  .  Although  (29)  ensures  transition  to  the  desired 
orthogonal  coordinates,  it  Is  advantageous  to  express  the  conversion 
depending  on  the  elements  of  the  Inclined  structure  because  they 
are  determined  first  in  the  course  Of  analysis.  In  this  case 
conversion  (29)  assumes  the  form 


A-Sf" 


(30) 


Matrix  A  gives  the  coordinates  of  terms  In  the  system  of  orthogonal 
hypothetical  terms.  -However,  this  solution  possesses  the  deficiency 
that  the  axis  of  the  first  orthogonal  hypothetical  term  Is  very  close 
to  the  centroid  of  the  first  group  of  terms,  and  the  axis  of  the 
second  -  is  too  removed  from  the  centroid  of  the  second  group. of 
terms. 


Therefore  the  transition  to  such  an  orthogonal  solution  is  of 
Interest,  when  correlations  with  the  factors  of  terms,  already  having 
high  coordinates  in  the  system  of  orthogonal  hypothetical  terms 
F, ,  F0,  ....  F  ,  approach  a  unit,  and  correlations  with  the  factors 
of  terms  having  low  coordinates  approach  zero. 

The  determination  of  new  factors  B.  ,  B»,  ...,  B_  satisfies  the 
principle  of  economy  (the  decrease  in  the  complexity  of  every  term), 
which  is  expressed  by  the  varlmaks  [Translator's  Note:  varimaksnyy  — 
word  not  established,  probably  based  on  a  proper  name]  criterion. 

v-«2  (i/p/w-2  (3D 

which  it  is  necessary  to  maximize. 

Let  us  designate  the  normalized  correlations  of  term  *,  with 
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F1  and  F2  by 


(32) 


and  the  normalized  correlations  with  B.^  and  fl2  -  through  Xj  and  Xj 
Then  the  orthogonal  conversion  can  be  written  in  the  form  of 


(33) 
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where  +  is  the  angle  of  turn  In  plane  rif2’  If  **  introduce  the 
designations 


o-«gw. 


that  the  degree  of  turn  can  be  presented  as 


(3H> 


.  .  D-2ABI k 

‘«*-C=T5P=TPpr. 


(35) 


Having  determined  the  angle  of  turn, 
correlations  Xj,  Ij  to  coordinates  b 


let  us 
Jl*  hJ2 


pass  from  the  normalized 
in  the  system  of  new 


orthogonal  hypothetical  terms  B1  and  ®2  according  to  the  following 
equations : 


(36) 


Thus  it  is  possible  to  arrange  all  terms  j  U  ■  1»  2,  ....  n)  in  the 
space  of  hypothetical  terms  B •••*  u8*n8  coordinates 

bjl*  bJ2’  ****  bJm‘ 

5.  CLASSIFICATIONS  OBTAINED  BY  THB  MACHINE 
ROUTE 

The  initial  data  for  the  construction  of  the  "term-document" 
matrix  was  a  sampling  of  208  abstracts  from  the  Reference  Journal 
"chemistry"  [RZh  "Khimiya"],  relating  to  the  chemistry  and  technology 
sugar-beet  production  (eliminating  the  obtaining  of  lime  and  carbon 
dioxide  and  the  reprocessing  of  wastes),  and  also  more  than  200 
keywords  describing  the  given  subject  area.  The  average  size  of  the 
selected  abstracts  was  about  20  lines. 
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The  coaposltion  of  the  dictionary  of  keywords  was  the  first 
step  in  the  construction  of  the  thesaurus.  Then  in  the  dictionary 
of  the  system  the  synonymy  of  keywords  was  reduced  by  means  of  the 
association  of  incomplete  and  thematic  synonyms  and  antonyms  into 
classes  of  arbitrary  equivalency.  The  classes  were  given  names  or 
descriptors,  which  were  the  most  frequently  encountered  keywords. 

For  Instance,  the  descriptor  which  received  the  name  "Diffu"  served 
for  the  designation  of  the  following  class  of  keywords  which  are 
considered  equivalent  within  the  limits  of  the  selected  subject 
area:  juice  extraction,  malting,  leaching,  recovery,  extraction, 
crude.  After  the  elimination  of  synonymy  the  dictionary  consisted  of 
200  descriptors  (terms).  Each  term  represents  a  word  form,  a 
combination  of  word  forms,  a  word,  its  truncated  form,  or  the  totality 
of  truncated  forms.  Truncation  was  accomplished  for  the  purpose 
of  reducing  the  volume  of  the  dictionary  and  is  thus  that  the  chain  of 
letters  which  forms  the  truncated  form  and  is  superimposed  from  left 
to  right  was  contained  In  all  word  forms,  from  which  appropriate 
truncated  form  was  obtained. 

For  all  documents  retrieval  patterns  were  made  up.  These  were 
the  sets  of  terms  together  with  the  number  of  their  appearances  in 
documents.  The  compilation  of  retrieval  patterns  was  carried  out 
manually  according  to  the  following  rule.  If  with  the  superposition 
of  a  certain  term  from  left  to  right  on  any  lexical  unit  of  text 
it  was  revealed  that  the  term  coincides  with  the  lexical  unit  or 
is  contained  in  it,  then  it  was  considered  that  the  corresponding 
term  is  Included  in  the  document.  In  the  compilation  of  retrieval 
patterns  the  bibliographical  data  of  the  primary  document  were  not 
considered. 

All  the  retrieval  patterns  make  up  the  initial  term-document 
matrix  with  the  dimension  200  *  208.  Since  the  machine  processing  of 
a  matrix  of  such  a  size  requires  considerable  time,  then  it  was 
reduced  by  means  of  the  elimination  from  it  of  the  terms,  the  activity 
coefficients  of  which  did  not  exceed  1.  The  remaining  terms,  used 
subsequently  for  the  construction  of  the  classification,  formed  by 
a  matrix  of  91  x  208. 
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Fop  the  realisation  of  the  proposed  procedures  two  programs 
were  written  in  the  assembly  language  of  the  BISMh-6.  The  first 
program,  which  realises  the  dichotomous  principle  of  the  construction 
of  classification,  contains  1400  one-address  instructions ;  its 
block  diagram  is  represented  in  Fig.  1.  The  flowchart  of  a  program, 
containing  1430  one-address  instructions  and  aocoapliahing  the 
automatic  construction  of  classification  which  maximises  information 
capacity,  is  represented  in  Fig.  2.  Both  block  diagrams  differ  in 
some  control  instructions  and  unit  2k  (Fig.  2),  The  remaining  units 
coincide  and  have  the  following  values. 


Fig.  1.  The  flowchart  of 
the  program  for  construction 
of  a  dichotomous  classifica¬ 
tion. 


Unit  1  uses  the  initial  data,1 placed  in  the  term-document 
matrix  with  the  dimension  9l  *  208,  for  the  construction  of  the 
correlation  matrix  of  terms  of  the  dimension  91  x  91 •  Unit  2, 
performing  as  the  correlation  matrix  and  the  matrix  of  the  activity 
coefficients  of  terns,  separates  the  semantic  centers  of  groups  and 
accomplishes  their  growth  by  means  of  the  calculation  of  B-coef flclents . 
The  result  of  the  operation  of  this  unit  are  the  groups  of  terms, 
and  also  the  correlation  matrices  of  terms  corresponding  to  these 
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groups.  Unit  3  unites  the  resulting  groups  of  terms  and  their 
corresponding  correlation  matrices,  preparing  them  for  factor 
analysis,  which  is  accomplished  by  unit  4. 


Pig.  2.  The  flowchart  of  the 
program  for  the  construction 
of  a  classification  which 
maximizes  information  capacity 


The  result  of  the  operation  of  unit  4  is  the  correlation  matrix 
of  terms  with  factors  and  the  correlation  matrix  between  factors. 

Unit  5  accomplishes  the  orthogonallzatlon  and  analytical  rotation 
of  multiple  factors. 

Unit  2A  (Pig.  2)  is  contained  only  in  the  second  program  and  is 
intended  for  the  comparison  of  the  groups  obtained  as  a  result  of  the 
operation  of  unit  2,  and  the  calculation  of  the  criterion  of  information 
capacity . 

In  the  construction  of  a  classification  by  the  dichotomous 
method  at  the  first  level  of  hierarchy  two  subject  headings  are 
formed.  This  is  not  sufficient  for  the  content  description  of  the 
selected  subject  area  because  they  do  not  contain  the  terms  relating 
to  the  chemistry  and  the  storage  of  sugar  beets. 


.  HUM,*** 
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The  criterion  of  information  capacity  facilitated  the  extraction 
of  the  croupe  of  olosely  connected  terms :  on  the  firet  level  of 
hierarchy  the  information  capacity  of  the  firet  group  of  terms 
(subject  heading  3)  comprised  20.6,  the  information  capacity  of  the 
first  three  groups  (heading  3,  2,  and  1)  -  38.5*  After  the  formation 
of  the  fourth  group  the  total  pairwise  intersection  of  groups  Increased 
sharply  and  information  capacity  was  redwood  to  20.4.  Therefore 
the  fourth  group  was  destroyed  and  at  this  stage  the  formation  of  the 
first  level  of  hierarchy  was  terminated. 

t 

The  first  experiments  on  the  further  division  of  the  groups 
of  terms  showed  that  the  subgroups  being  formed  intersect  very 
strongly.  This  permitted  the  making  of  two  conclusions.  In  the 
first  place,  in  proportion  to  the  specialization  of  the  groups  of 
terms  their  density  or  homogeneity,  measured  by  the  P-coefficient , 
should  increase.  Therefore  the  value  Of  the  P-eoefflclent .every 
time  during  the  transition  to  the  following  level  of  hierarchy 
increased  by  1.  In  the  second  place,  as  centers  in  the  formation 
'of  subgroups  one  should  seleat  terms  whiah  are  less  active  than 
during  the  formation  of  groups. 

The  appropriate  changes  were  introduced  into  the  program  for 
the  BESM-6  and  as  a  result  13  factors  were  obtained  together  with 
the  correlations  of  terms  with  factors  after  their  orthogonalization 
and  analytical  rotation.  About  4  minutes  of  machine  time  were 
required  for  obtaining  the  factors. 

The  factors  represents  the  groups  of  terms  organized  in  the 
sequence  of  reduction  of  their  correlations  With  all  elements  of  the 
groups.  These  groups  were  subjected  to  analysis  depending  on  the 
semantic  content  of  the  terms  and  the  values  of  their  correlations 
with  factors.  The  analysis  was  conducted  for  the  purpose  of  awarding 
names  to  the  groups  which  were  equivalent  in  their  oontent  to  the 
subject  headings.  This  procedure,  which  is  called  the  interpretation 
of  factors  and  Is  subjective,  is  carried  out  depending  on  the 
semantics  of  those  terms  which  have  the  highest  correlations  with 
the  appropriate  factors. 
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The  subjectivity  of  Interpretation  does  not  contradict  the 
requirement  introduced  by  us  previously  for  the  complete  algorithmiza¬ 
tion  of  the  procedures  for  the  construction  of  classifications,  because 
it  is  conducted  for  the  purpose  of  checking  the  accuracy  of  automatic 
registered  indexing  with  the  application  of  the  classification 
obtained.  . 


As  an  example  Table  1  shows  one  of  the  13  groups  obtained 
containing  nine  terms  and  their  correlations  with  the  appropriate 
factor. 


Table  1. 

Carat  0 . 82 
Storage-preservation  0.75 
Beet-root  0.74 
Ventilation  0.70 
Waste-damage  0 . 49 
Technolog-  0.44 
Sugar  0 . 36 
Inver-  -inversion- invert  sugar  RV  O.35 
Content  0 . 30 


As  a  result  of  the  interpretation  of  this  group  the  following 
subject  heading  was  adopted. 

The  storage  of  sugar  beets.  Physicochemical  processes  during 
storage. 

The  subject  headings  and  their  hierarchical  structure  obtained 
during  the  interpretation  of  all  factors  are  given  in  Table  2. 


6.  AUTOMATIC  INDEXING  AND  ITS  RESULTS 


For  checking  the  value  of  the  classification  obtained  and, 
therefore,  checking  the  suitability  of  statistical  methods  for  the 
automatic  construction  of  classifications,  the  automatic  indexing 
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of  a  control  sampling  of  100  abstracts  was  carried  out .  The  criterion 
of  accuracy  of  automatic  indexing  was  manual  indexing,  carried  out 
by  three  subject  indexers  (specialists  in  the  technology  of  sugar-beet 
production)  independently  from  one  another.  The  subject  indexers 
referred  every  abstract  to  the  most  relevant  subject  heading  or  sub¬ 
heading,  which  were  taken  from  Table  2.  .  Every  abstract  was  given  that 
subject  heading  (subheading)  which  was  assigned  to  it  by  no  less  than 
two  subject  Indexers.  The  results  obtained  were  considered  absolutely 
exact  and  the  results  of  automatic  indexing  were  compared  with  them. 

•  I . 

In  factor  analysis  the  measure  of  the  relevancy  of  the  document 
to  the  subject  heading  is  considered  as  the  sum  of  the  products  of  the 
number  of  appearances  of  terms  in  a  document  and  their  correlations 
with  the  subject  headings.  In  our  case  such  an  approach  is  not  a 
applicable,  since  the  classification  is  hierarchical,  that  is,  a 
subheading  always  contains  a  smaller  number  of  terms  than  the  corre¬ 
sponding  subject  heading. 

In  the  compilation  of  an  algorithm  of  automatic  indexing  they 
proceed  from  the  following  considerations.  In  the  first  place,  all 
the  subheadings  should  be  reduced  to  the  dimension  of  the  appropriate 
headings.  In  the  second  place,  the  value  of  the  relevancy  of  the 
document  of  the  subject  heading  (subheading)  should  be  greater, 
the  greater  the  intersection  of  the  retrieval  pattern  of  the  document 
with  the  subject  heading  (subheading). 

Because  of  the  aforesaid,  the  calculation  of  the  value  of 
relevancy  RV  of  the  document  i  to  the  subject  heading  (subheading) 
p  is  accomplished  using  the  following  formula: 

*>v-3?S  *vc'«*  (37) 

i- : 


where  —  the  intersection  of  the  retrieval  pattern  of  the  document 
with  the  subject  heading  (subheading)  p;  8.  —  the  number  of  terms  of 
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the  retrieval  pattern  of  the  document;  -  the  correlation  of  term 
j  with  the  subject  heading  (subheading)  p  after  the  reduotlon  of  the 
dimensions  of  the  subheading  to  the  dimensions  of  the  corresponding 
headings . 

Let  x •  be  the  document-terms  matrix. which  represents  the  numbers 
of  appearances  in  each  of  the  documents  of  the  sampling  of  all  terms; 
B  -  the  correlation  matrix,  the  term-subject  heading  (subheading), 
obtained  as  a  result  of  factor  analysis,  orthogonalization  and 
analytical  rotation,  and  3^  -  the  coefficient  of  reduction  of  the 
subheading  to  the  dimensions  of  the  appropriate  heading  p.  Then 
(37)  can  be  rewritten  in  matrix  form: 


(38) 


The  appropriate  program  of  automatic  indexing  was  written  in 
the  ALQOL  language  and  realized  on  the  BESM-6.  For  the  indexing  of 
one  document  about  0.3  seconds  of  machine  time  is  expended.  The 
computer  correctly  indexed  6*1  documents  out  of  99,  so  the  accuracy 
of  automatic  indexing  comprised  64. 6%. 

One  ought  to  note  especially  that  the  64  correctly  indexed 
documents  did  not  include  17  of  those  which  were  assigned  by  the 
machine  not  to  a  subheading  (as  in  manual  indexing)  but  to  the 
corresponding  subject  heading.  Thus  the  introduction  of  a  more 
exact  criterion  of  relevancy  can  increase  the  accuracy  of  automatic 
indexing  with  the  application  of  the  classification  obtained 
automatically  by  at  least  up  to  80jtf. 

7.  THE  RELIABILITY  OF  MANUAL  INDEXING  AND 
AUTOMATIC  INDEXING. 

The  accuracy  of  automatic  Indexing  In  many  respects  is  determined 
by  the  concordance  of  the  results  of  manual  categorization.  This 
circumstance  puts  forth  the  requirement  for  the  measurement  of  the 
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reliability  of  manual  indexing  and  an  evaluation  of  that  accuracy  of 
automatic  indexing  which  would  take  place  if  manual  indexing  was 
absolutely  reliable. 

The  result  of  any  indexing  is  the  attaching  to  the  document  of 
the  most  relevant  subject  heading  or  subheading,  which  emerge  only 
by  nominal  definitions  and  do  not  have  a  quantitative  expression. 
Therefore  for  the  measurement  of  the  degree  of  concordance  of  the 
results  of  indexing  only  those  statistical  criteria  are  used  which 
measure  the  closeness  of  the  bond  between  quality  characteristics. 

As  such  a  criterion  we  selected  the  coefficient  of  mutual  contingency, 
which  in  a  specific  sense  is  the  equivalent  of  the  correlation 
factor.  The  criterion  ensures  the  possibility  of  the  comparison  of 
our  results  with  the  results  obtained  by  0.  Borko  [1].  For  the 
calculation  of  the  coefficient  of  contingency  between  the  results 
of  indexing  by  various  subject  indexers  and  by  machine  the  appropriate 
program  was  written  in  the  ALGOL  language  and  realized  on  the  BESM-6. 
The  results  of  the  computations  are  given  in  Table  3;  manual  indexing 
is  marked  by  the  indices  1,  2,  3,  automatic  —  by  the  letter  a 


Table  3.  Correlation  of  the  results 
of  Indexing. 


—(a)- 

MtTOI 

1 

s 

3 

9 

1 

o.net 

0,3371 

0.3063 

1 

O.ftMft 

0,3371 

9.3946 

3 

0.W7I 

0,3371 

0,8964 

m 

am 

0.3946 

0,1064 

ft _ -0.8370 

4M|-4.a90B 

KEY:  (a)  Method 


The  reliability  of  manual  indexing  R _  is  the  mean  of  the 

mm 

coefficients  of  contingency  C^2,  C^»  <?23  and  comprises  0.9370. 

The  correlation  of  the  results  of  automatic  indexing  and  manual 
indexing  C&2,  C&^  in  all  cases  is  lower  than  the  correlation 

of  the  results  of  manual  Indexing,  and  its  mean  R  -  0.8988  indicates 
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tile  M|N«  Of  coordination  between  autoaatlo  indexing  and  Manual 
Indexing,  it  la  necessary  to  not*  especially  that  tho  oorrelation 
of  the  results  of  automatic  and  Manual  indexing  »  •  0.8988  obtained 

by  us  is  somewhat  higher  than  the  correlation  between  various  subject 
indexers  R m  •  0.870  by  0.  Borko  Cl}.  This  again  eonfiras  the 
effectiveness  of  the  Method,  selected  by  us  for  the  automatic 
construction  of  classification  and  indexing. 

if  manual  indexing  was  absolutely  reliable «  then  the  eorreiatlon 
of  automatic  Indexing  with  it  comprised  &aam  •  *  0. 9592. 

Now  let  us  present  the  total  number  of  documents  being  indexed  in  the 
form  of  the  sum  of  two  terms:  the  nuaber  Of  oorreotly  and  incorrectly 
indexed  documents  (with  absolutely  reliable  manual  Indexing) .  The 
first  value  is  determined  by  the  coefficient  of  determination 
d  «  1,2 aam  m  °*92oo.  This  means  if  manual  Indexing  was  absolutely 
reliable,  then  with  an  automatically  constructed  classification  and 
the  described  algorithm  of  indexing  it  wae  poseible  to  oorrectiy 
index  92%  of  the  total  number  of  documents ,  which  is  considerably 
higher  than  in  0.  Borko' s  method  (678). 

CONCLUSIONS 

1.  The  results  of  the  test  show  that  th#  criteria  proposed  for 
the  complete  algorithmization  Of  the  construction  of  classifications 
(the  criterion  which  defines  the  number  of  levele  of  hierarchy 

in  the  resulting  classification,  and  the  criterion  of  information 
capacity  which  determines  the  optimum  number  of  subject  headings 
on  every  level  of  hierarchy)  permit  the  constructing  of  systems  of 
classifications  which  are  easy  to  interpret  and  deecrlbe  sufficiently 
fully  the  appropriate  subject  area. 

2.  The  machine  test  on  automatic  indexing  with  the  application 
of  an  automatically  constructed  system  of  classification  confirms 
the  effectiveness  of  the  described  method  of  construction  of 
classifications,  because  the  results  were  higher  even  in  comparison 
with  the  methods  of  the  indexing  which  Intuitively  precedes  the 
compiled  classification  of  documents 
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